Skip to content

[SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly#8166

Closed
yanboliang wants to merge 7 commits into
apache:masterfrom
yanboliang:spark-9793
Closed

[SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly#8166
yanboliang wants to merge 7 commits into
apache:masterfrom
yanboliang:spark-9793

Conversation

@yanboliang

Copy link
Copy Markdown
Contributor

PySpark DenseVector, SparseVector __eq__ method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector __hash__ method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.

@SparkQA

SparkQA commented Aug 13, 2015

Copy link
Copy Markdown

Test build #40766 has finished for PR 8166 at commit 1b4ed66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang

Copy link
Copy Markdown
Contributor Author

Jenkins, test this please.

@SparkQA

SparkQA commented Aug 15, 2015

Copy link
Copy Markdown

Test build #40949 has finished for PR 8166 at commit 2a85d09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang yanboliang changed the title [SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector __eq__ should use semantics [SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly Aug 15, 2015

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since k1 will be at most == v1_size due to the earlier while, checking for == here will suffice and is easier to read

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto for k2

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think checking k1 >= v1_size is more robust than k1 == v1_size, and Scala code also use the former one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's fine with me

@feynmanliang

Copy link
Copy Markdown
Contributor

LGTM after docstring change

@SparkQA

SparkQA commented Aug 27, 2015

Copy link
Copy Markdown

Test build #41666 has finished for PR 8166 at commit d63d54e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread python/pyspark/mllib/linalg/__init__.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it return False?

@mengxr

mengxr commented Sep 11, 2015

Copy link
Copy Markdown
Contributor

@yanboliang Please update the PR to use the first 128 nonzeros entries to compute hash.

@SparkQA

SparkQA commented Sep 14, 2015

Copy link
Copy Markdown

Test build #42420 has finished for PR 8166 at commit 3b8ac7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the code more readable:

if isnan(value):
  value = float('nan')
return struct.unpack('Q', struct.pack('d', value))[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants